Performance Bounds for Pairwise Entity Resolution

نویسندگان

  • Matt Barnes
  • Kyle Miller
  • Artur Dubrawski
چکیده

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small holdout datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

On Pairwise Kernels: An Efficient Alternative and Generalization Analysis

Pairwise classification has many applications including network prediction, entity resolution, and collaborative filtering. The pairwise kernel has been proposed for those purposes by several research groups independently, and become successful in various fields. In this paper, we propose an efficient alternative which we call Cartesian kernel. While the existing pairwise kernel (which we refer...

متن کامل

Cartesian Kernel: An Efficient Alternative to the Pairwise Kernel

Pairwise classification has many applications including network prediction, entity resolution, and collaborative filtering. The pairwise kernel has been proposed for those purposes by several research groups independently, and has been used successfully in several fields. In this paper, we propose an efficient alternative which we call a Cartesian kernel. While the existing pairwise kernel (whi...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

An Active Learning Approach to Coreference Resolution

In this paper, we define the problem of coreference resolution in text as one of clustering with pairwise constraints where human experts are asked to provide pairwise constraints (pairwise judgments of coreferentiality) to guide the clustering process. Positing that these pairwise judgments are easy to obtain from humans given the right context, we show that with significantly lower number of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1509.03302  شماره 

صفحات  -

تاریخ انتشار 2015